According to Epsilon research, 80% of customers are more likely to do business with a company if it provides personalized service. Banking is no exception. The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner, often before they have even realized they need the service.
Santander Group aims to go a step beyond providing a customer with financial service and intends to determine the amount or value of the customer’s transaction. The primary focus is on Digital Delivery of Financial Services, reinforcing distribution and the overall customer experience in the new digital environment. The company strives to achieve it by using Big Data Analytics with platforms to leverage financial and non-financial information. That means anticipating customer needs in a more concrete, but also simple and personal way.
With so many choices for financial services, this need is greater now than ever before. This is a first step that Santander strives to nail in order to personalize their services at scale.
A US bank used machine learning to study the discounts its private bankers were offering to customers. Bankers claimed that they offered discounts only to valuable ones and more than made up for that with other, high-margin business. The analytics showed something different: patterns of unnecessary discounts that could easily be corrected. After the unit adopted the changes, revenues rose by 8% within a few months.
A top consumer bank in Asia enjoyed a large market share but lagged behind its competitors in products per customer. It used advanced analytics to explore several sets of big data: customer demographics & key characteristics, products held, credit-card statements, transaction & point-of-sale data, online and mobile transfers & payments, and credit-bureau data. The bank discovered unsuspected similarities that allowed it to define 15,000 microsegments in its customer base. It then built a next-product-to-buy model that increased the likelihood of buying three times over.
Project Objectives:
Santander Group wants to predict the value of future customer transactions (target column) in the test set with the minimal error. The evaluation metric for this project is Root Mean Squared Logarithmic Error.
To solve that challenge, we are planning to follow CRISP-DM outline:
We are provided with an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column.
File descriptions:
transaction.data <- read.csv(file="train.csv", header=TRUE, sep=",")
test_non_zero_base<-read.csv("test.csv", header= TRUE, sep=",")
attach(transaction.data)
attach(test_non_zero_base)
options("scipen" = 999, "digits" = 10)
set_plot_dimensions <- function(width_choice, height_choice) {
options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}
str(transaction.data, list.len = 10, vec.len = 5)
## 'data.frame': 4459 obs. of 4993 variables:
## $ ID : Factor w/ 4459 levels "000d6aaf2","000fbd867",..: 1 2 3 4 5 6 7 8 9 10 11 12 ...
## $ target : num 38000000 600000 10000000 2000000 14400000 2800000 164000 600000 979000 460000 1100000 16000000 ...
## $ X48df886f9: num 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X0deb4b6a8: int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X34b15f335: num 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ a8cb14b00 : int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X2f0771a37: int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X30347e683: int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ d08d1fbe3 : int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X6ee66e115: int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
summary <- summary.data.frame(transaction.data)
summary[1:6, 1:10]
## ID target X48df886f9
## 000d6aaf2: 1 Min. : 30000 Min. : 0.00
## 000fbd867: 1 1st Qu.: 600000 1st Qu.: 0.00
## 0027d6b71: 1 Median : 2260000 Median : 0.00
## 0028cbf45: 1 Mean : 5944923 Mean : 14654.93
## 002a68644: 1 3rd Qu.: 8000000 3rd Qu.: 0.00
## 002dbeb22: 1 Max. :40000000 Max. :20000000.00
## X0deb4b6a8 X34b15f335 a8cb14b00
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.000 Median : 0.00 Median : 0.000
## Mean : 1390.895 Mean : 26722.45 Mean : 4530.164
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :4000000.000 Max. :20000000.00 Max. :14800000.000
## X2f0771a37 X30347e683 d08d1fbe3
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 26409.96 Mean : 30708.11 Mean : 16865.22
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :100000000.00 Max. :20708000.00 Max. :40000000.00
## X6ee66e115
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 4669.208
## 3rd Qu.: 0.000
## Max. :10400000.000
Preliminary observations:
Time series nature - the dataset appears to be a time series in both dimensions, row wise and column wise.
Disguised meaning of the columns - each column seems to represent individual transaction amounts, possibly related to different types.
library(DataExplorer)
library(ggplot2)
library(data.table)
library(dplyr)
library(plotly)
library(e1071)
library(tidyr)
library(purrr)
library(compare)
library(tidyverse)
library(caret)
library(leaps)
library(MASS)
library(mltools)
library(psych)
library(rpart)
library(rpart.plot)
library(arules)
library(arulesViz)
library(knitr)
library(randomForest)
First, we want to assess the data quality in terms of missing values and take a closer look at the target variable, its distribution, and summary statistics.
#plot_missing(transaction.data)
#transaction.data[!complete.cases(transaction.data),]
#sapply(transaction.data, function(x) sum(is.na(x)))
#Due to the size of the data set, commands above are difficult to print in the report
sum(is.na(transaction.data))
## [1] 0
ggplot(transaction.data,aes(x=target))+geom_histogram(fill="blue",bins=50)+scale_x_continuous(trans='log2')+ggtitle("Histogram Distribution of Target")
box_plot <- ggplot(transaction.data, aes(y= target)) +
geom_boxplot() +
ylab("Target") +
scale_y_continuous(trans='log2')+
ggtitle("Box Plot of Target")
box_plot
qqnorm(transaction.data$target,
datax = TRUE,
col = "red",
main = "Normal Q-Q Plot of target Distribution")
qqline(transaction.data$target,
col = "blue",
datax = TRUE)
(min(target))
## [1] 30000
(max(target))
## [1] 40000000
(target_lcutoff <- quantile(target,.25))
## 25%
## 600000
(target_ucutoff <- quantile(target,.75))
## 75%
## 8000000
(median(target))
## [1] 2260000
(mean(target))
## [1] 5944923.322
As we can see, there are no missing values. The target variable is not normally distributed with several outliers that we will need to pay attention to during Data Preparation stage. The mean is higher than the median, so the distribution is right-skewed. Also, looking at the min and max, the range is very wide.
Next, we dig deeper into the preliminary observations from the previous section. The broader hypothesis that we analyze is: columns and rows were originally time ordered and then shuffled for the competition.
num_rows / days_in_week : 4459 / 7 = 637
num_cols / days_in_week : 4991 / 7 = 713
This serves as an additional point in support of the hypothesis that the data represents weekly transactional activity. Based on other observations, this dataset does not seem to contain any aggregate features.
To prepare for modeling, we want to better understand the meaning of columns & rows. Further, we evaluate whether all the data is truly significant for our analysis. The key criterion is the number of zeros vs. the number of unique values.
tran.data.zero<-data.table(transaction.data)
n_zeros <- tran.data.zero[, lapply(.SD, function(x) sum(x == 0) / length(x))] %>% unlist
a <-list(
autotick = FALSE,
ticks = "outside",
tick0 = 0.6,
dtick = 0.1,
range = c(0.6, 1),
ticklen = 5,
tickwidth = 2,
tickcolor = toRGB("blue")
)
plot_ly(x = ~n_zeros, type = "histogram",
marker = list(color = "dodgerblue")) %>%
layout(xaxis = a, title = "Histogram of % of zeros in dataset", titlefont = list(color = '#000000', size = 20), margin = list(l = 50, t=40))
Source: Kaggle
As a start, we select the subset of the training data where columns have more than 800 non-zero values and rows have more than 650.
x<-colSums(transaction.data != 0)
y<-colnames(transaction.data)
x_name<-"Count"
y_name<-"Col_name"
Train_nz<- data.frame(x, y)
colnames(Train_nz) <- c(x_name, y_name)
#Include columns with non_zero values greater than 800
Subset1<-Train_nz[Train_nz$Count>800,]
Subset1$Col_name<-as.character(Subset1$Col_name)
#head(Subset1$Col_name)
#str(Subset1$Col_name)
train_non_zero<-transaction.data[Subset1$Col_name]
#head(train_non_zero,3)
w<-rowSums(transaction.data != 0)
t<-rownames(transaction.data)
w_name<-"Count"
t_name<-"Row_name"
Train_nz2<- data.frame(w, t)
colnames(Train_nz2) <- c(w_name, t_name)
#head(Train_nz2)
#Include rows with non_zero values greater than 650
Subset1a<-Train_nz2[Train_nz2$Count>650,]
Subset1a$Row_name<-as.character(Subset1a$Row_name)
#head(Subset1a$Row_name)
#str(Subset1a$Row_name)
train_non_zero<-train_non_zero[Subset1a$Row_name,]
head(train_non_zero,3)
## ID target X20aa07010 X963a49cdc X68a945b18 X935ca66a9
## 6 002dbeb22 2800000 17020000 17020000 17020000 17020000
## 78 03ec1414d 200000 350000 350000 300000 300000
## 115 06cd03d09 3095200 4005500 3420000 167000 1109500
## X8.61E.26 X26fc93eb7 X0572565c2 X66ace2992 fb49e4212 X6619d81fc
## 6 17020000 17020000 17020000 17020000 17020000.00 17020000
## 78 300000 350000 350000 350000 350000.00 350000
## 115 1100000 8260400 40000 0 16818666.66 3252500
## aca228668 X6eef030c1 bc70cbc26 fc99f9426 X7e814a30d bd6da0cca
## 6 8891333.34 0.00 8891333.34 17020000 8891333.34 17020000
## 78 685000.00 0.00 685000.00 350000 685000.00 300000
## 115 819500.00 1859333.34 1614909.10 22134000 236857.14 500000
## X1db387535 b43a7cfd5 X024c577b9 ea772e115 ad009c8b9 X2ec5b290f
## 6 17020000 17020000.00 17020000.00 17020000 17020000 17020000
## 78 350000 350000.00 350000.00 300000 300000 350000
## 115 2090000 282333.34 1554666.66 0 4240000 3012000
## X0ff32eb98 X166008929 X58e056e12 X241f0f867 cbb673163 X1931ccfdd
## 6 17020000 17020000 0 17020000 17020000 17020000
## 78 350000 300000 350000 350000 300000 350000
## 115 547000 140000 3862000 10491200 24200000 14145000
## f02ecb19c X58e2e02e6 X1fe5d56b9 X9fd594eec fb0f5dbfe X91f701ba2
## 6 17020000 0 17020000 17000000 0.0 17020000
## 78 300000 17000000 300000 4000000 350000.0 350000
## 115 600000 0 0 7279000 111791333.3 25100000
## ca2b906e8 X703885424 f97d9431e X939f628a7 X62fb56487 eeb9cd3aa
## 6 17020000 17020000.00 17020000 8891333.34 17020000 2800000
## 78 300000 350000.00 300000 685000.00 300000 5700000
## 115 6880000 1448888.88 25100000 371200.00 3420000 1060000
## X324921c7b X58232a6fb c0d2348b7 X491b9ee45 c8d582dd2 d6bb78916
## 6 17020000.00 17020000.00 17020000 17020000 17020000 17020000
## 78 350000.00 350000.00 300000 350000 300000 350000
## 115 11973333.34 10814666.66 2090000 450000 880000 42000
## c10f31664 X4bcf15776 X70feb1494 adb64ff71 X11e12dbe8 X9de83dc23
## 6 17020000 17020000 17020000 17020000 17020000 17020000
## 78 300000 300000 350000 350000 300000 300000
## 115 38800000 5200000 2190000 25300000 30000000 25300000
## X62e59a501 X15ace8c9f X5c6487af1 bb0ce54e9 f190486d6 f74e8f13d
## 6 17020000.00 556000 17020000.0 17020000 0 17020000
## 78 350000.00 350000 350000.0 300000 0 350000
## 115 5352666.66 972000 147090.9 4400000 3530500 2286500
## X5d3b81ef8 X77deffdf0 c5a231d81 e176a204a X1702b5bf0 a09a238d0
## 6 8891333.34 17020000 17020000 17020000 17020000.00 17020000
## 78 685000.00 300000 350000 350000 350000.00 300000
## 115 929600.00 26400000 11115200 30000000 3056888.88 0
## X190db8488 c47340d97 X8781e4b91 ef30f6be5 X23310aa6f
## 6 17020000 17020000.00 17020000 8891333.34 17020000
## 78 350000 350000.00 300000 685000.00 350000
## 115 819750 2253333.34 0 660000.00 23985000
write.csv(train_non_zero, file = "train_non_zero.csv",row.names=FALSE)
This approach allows to identify ~70 variables and ~140 observations that appear to be the most impactful for the target variable. We also added a column with the mean value for each row.
Proceeding further, more advanced algorithms could be used to detect the patterns between columns and rows. For example, a mix of feature importance, sorting columns and rows by sum of non-zeros, and correlation plus RMSE between columns.
We could also consider Principal Component Analysis to further group the variables.
train2<-subset(train_non_zero,select=-c(target,ID))
pc<-prcomp(train2)
summary(pc)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 28452263.12503 13814069.95623 13658500.42837
## Proportion of Variance 0.20904 0.04928 0.04817
## Cumulative Proportion 0.20904 0.25831 0.30648
## PC4 PC5 PC6
## Standard deviation 13495921.79137 13305484.60045 12829748.54822
## Proportion of Variance 0.04703 0.04571 0.04250
## Cumulative Proportion 0.35351 0.39923 0.44173
## PC7 PC8 PC9
## Standard deviation 12653298.27604 12382544.67908 12003277.99093
## Proportion of Variance 0.04134 0.03959 0.03720
## Cumulative Proportion 0.48307 0.52267 0.55987
## PC10 PC11 PC12
## Standard deviation 11789582.14638 11236817.00212 11118053.71473
## Proportion of Variance 0.03589 0.03260 0.03192
## Cumulative Proportion 0.59576 0.62836 0.66028
## PC13 PC14 PC15
## Standard deviation 10837418.56124 10668063.29444 10382706.24548
## Proportion of Variance 0.03033 0.02939 0.02784
## Cumulative Proportion 0.69061 0.72000 0.74783
## PC16 PC17 PC18
## Standard deviation 9939811.05391 9696207.04405 8555770.81417
## Proportion of Variance 0.02551 0.02428 0.01890
## Cumulative Proportion 0.77335 0.79762 0.81652
## PC19 PC20 PC21
## Standard deviation 8114606.09850 7655532.37539 7177005.58347
## Proportion of Variance 0.01700 0.01513 0.01330
## Cumulative Proportion 0.83353 0.84866 0.86196
## PC22 PC23 PC24
## Standard deviation 7092703.81095 6836254.75690 6436286.08787
## Proportion of Variance 0.01299 0.01207 0.01070
## Cumulative Proportion 0.87495 0.88702 0.89772
## PC25 PC26 PC27
## Standard deviation 6066468.41110 5916894.32180 5844677.62758
## Proportion of Variance 0.00950 0.00904 0.00882
## Cumulative Proportion 0.90722 0.91626 0.92508
## PC28 PC29 PC30
## Standard deviation 5658296.64025 5366417.45323 4979286.03273
## Proportion of Variance 0.00827 0.00744 0.00640
## Cumulative Proportion 0.93335 0.94078 0.94718
## PC31 PC32 PC33
## Standard deviation 4270083.81550 4130348.39512 4000862.76699
## Proportion of Variance 0.00471 0.00441 0.00413
## Cumulative Proportion 0.95189 0.95630 0.96043
## PC34 PC35 PC36
## Standard deviation 3889047.66908 3763794.43003 3556414.95474
## Proportion of Variance 0.00391 0.00366 0.00327
## Cumulative Proportion 0.96434 0.96799 0.97126
## PC37 PC38 PC39
## Standard deviation 3487221.74456 3393276.63263 3109433.79942
## Proportion of Variance 0.00314 0.00297 0.00250
## Cumulative Proportion 0.97440 0.97737 0.97987
## PC40 PC41 PC42
## Standard deviation 2971497.79532 2828930.86913 2685540.31369
## Proportion of Variance 0.00228 0.00207 0.00186
## Cumulative Proportion 0.98215 0.98422 0.98608
## PC43 PC44 PC45
## Standard deviation 2523885.28039 2349239.93486 2293042.40406
## Proportion of Variance 0.00164 0.00143 0.00136
## Cumulative Proportion 0.98772 0.98915 0.99051
## PC46 PC47 PC48
## Standard deviation 2183298.96044 2054172.21424 1982366.01076
## Proportion of Variance 0.00123 0.00109 0.00101
## Cumulative Proportion 0.99174 0.99283 0.99384
## PC49 PC50 PC51
## Standard deviation 1852424.03085 1808851.76826 1647618.96710
## Proportion of Variance 0.00089 0.00084 0.00070
## Cumulative Proportion 0.99473 0.99557 0.99627
## PC52 PC53 PC54
## Standard deviation 1565368.66280 1463802.14245 1417607.76641
## Proportion of Variance 0.00063 0.00055 0.00052
## Cumulative Proportion 0.99691 0.99746 0.99798
## PC55 PC56 PC57
## Standard deviation 1259265.90789 1202521.97104 1055083.47765
## Proportion of Variance 0.00041 0.00037 0.00029
## Cumulative Proportion 0.99839 0.99876 0.99905
## PC58 PC59 PC60 PC61
## Standard deviation 910067.00528 861731.07000 766484.55635 611359.0109
## Proportion of Variance 0.00021 0.00019 0.00015 0.0001
## Cumulative Proportion 0.99926 0.99945 0.99961 0.9997
## PC62 PC63 PC64 PC65
## Standard deviation 516826.32229 468050.48849 456246.29974 432446.59388
## Proportion of Variance 0.00007 0.00006 0.00005 0.00005
## Cumulative Proportion 0.99977 0.99983 0.99988 0.99993
## PC66 PC67 PC68 PC69
## Standard deviation 345668.88231 259402.47042 224748.79172 178658.51501
## Proportion of Variance 0.00003 0.00002 0.00001 0.00001
## Cumulative Proportion 0.99996 0.99998 0.99999 1.00000
#plot(pc)
plot(pc,type="l")
set_plot_dimensions(1200, 1200)
biplot(pc, cex.lab=0.8, cex.axis=0.8, cex.main=0.7, cex.sub=0.5)
#attributes(pc)
The Principal Component analysis identified 69 components which is close to the total number of variables in our latest subset. Therefore, we proceed with train_non_zero subset.
Once the desired subset is selected, we analyze its structure and identify the necessary elements for the data preparation process.
str(train_non_zero, list.len = 10, vec.len = 5)
## 'data.frame': 141 obs. of 71 variables:
## $ ID : Factor w/ 4459 levels "000d6aaf2","000fbd867",..: 6 73 109 121 145 150 173 186 207 228 238 373 ...
## $ target : num 2800000 200000 3095200 10814667 200000 25100000 ...
## $ X20aa07010: num 17020000 350000 4005500 3420000 747474 600000 ...
## $ X963a49cdc: num 17020000 350000 3420000 25300000 747474 35050000 ...
## $ X68a945b18: num 17020000 300000 167000 1109500 1330000 1800000 ...
## $ X935ca66a9: num 17020000 300000 1109500 1800000 1330000 0 ...
## $ X8.61E.26 : num 17020000 300000 1100000 4900000 1330000 715000 ...
## $ X26fc93eb7: num 17020000 350000 8260400 2253333 747474 2186000 ...
## $ X0572565c2: num 17020000 350000 40000 2300000 747474 0 ...
## $ X66ace2992: num 17020000 350000 0 25100000 747474 430000 ...
## [list output truncated]
summary.subset <- summary.data.frame(train_non_zero)
summary.subset[1:6, 1:10]
## ID target X20aa07010 X963a49cdc
## 002dbeb22: 1 Min. : 30000 Min. : 0 Min. : 0
## 03ec1414d: 1 1st Qu.: 380000 1st Qu.: 307000 1st Qu.: 338000
## 06cd03d09: 1 Median : 1359000 Median : 834800 Median : 814800
## 077787b5d: 1 Mean : 4380924 Mean : 4276344 Mean : 3421982
## 08c1f91b0: 1 3rd Qu.: 5187333 3rd Qu.: 3252500 3rd Qu.: 2186000
## 08f544828: 1 Max. :35050000 Max. :111791333 Max. :35050000
## X68a945b18 X935ca66a9 X8.61E.26
## Min. : 0 Min. : 0.0 Min. : 0
## 1st Qu.: 65000 1st Qu.: 140000.0 1st Qu.: 92000
## Median : 444000 Median : 747333.3 Median : 732000
## Mean : 2992850 Mean : 2827357.4 Mean : 2533558
## 3rd Qu.: 1775333 3rd Qu.: 1775333.3 3rd Qu.: 1775333
## Max. :38800000 Max. :30000000.0 Max. :30000000
## X26fc93eb7 X0572565c2 X66ace2992
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 338000 1st Qu.: 350000 1st Qu.: 338000
## Median : 819750 Median : 834800 Median : 819750
## Mean : 3410870 Mean : 2996867 Mean : 3518926
## 3rd Qu.: 1933333 3rd Qu.: 1563412 3rd Qu.: 2253333
## Max. :35050000 Max. :35050000 Max. :35050000
plot_histogram(train_non_zero)
plot_correlation(train_non_zero,type="continuous", theme_grey(base_size=5))
Note: the provided test set does not contain the target variable, so our evaluation of the models will have to be based on the training set. However, we still apply all the necessary data preparation steps to the test set to showcase the process.
First, we create a similar subset from the test data, using exactly the same columns and similar non-zero rows.
#Compare test and train base files
comparison <- compare(transaction.data,test_non_zero_base,allowAll=TRUE)
comparison$tM
semi_join(transaction.data,test_non_zero_base)
It seems like there aren’t any common rows.
subset_colnames<-colnames(train_non_zero)
subset_ID<-as.character(train_non_zero$ID)
test_names<-names(test_non_zero_base)[names(test_non_zero_base) %in% subset_colnames]
test_ID<-test_non_zero_base$ID[test_non_zero_base$ID %in% subset_ID]
test_non_zero <-test_non_zero_base[, test_names]
z<-rowSums(test_non_zero_base != 0)
q<-rownames(test_non_zero_base)
z_name<-"Count"
q_name<-"Row_name"
Train_nz3<- data.frame(z, q)
colnames(Train_nz3) <- c(z_name, q_name)
#head(Train_nz3)
#Include rows with non_zero values greater than 950
Subset1b<-Train_nz3[Train_nz3$Count>950,]
Subset1b$Row_name<-as.character(Subset1b$Row_name)
#head(Subset1b$Row_name)
#str(Subset1b$Row_name)
test_row_names<-rownames(test_non_zero)[rownames(test_non_zero) %in% Subset1b$Row_name]
test_non_zero<-test_non_zero[test_row_names, ]
head(test_non_zero,3)
## ID X20aa07010 X963a49cdc X68a945b18 X935ca66a9 X26fc93eb7
## 874 046868511 0 0.00 97000 1180000 136800.00
## 1369 07061afcb 8120000 20000000.00 1000000 2276500 340666.66
## 1584 08337f345 380000 110666.66 600000 68000 186000.00
## X0572565c2 X66ace2992 fb49e4212 X6619d81fc aca228668 X6eef030c1
## 874 5510800 59333.34 78000.00 6060000.00 161619.04 186000
## 1369 186000 380000.00 8478666.66 3840000.00 151500.00 5345600
## 1584 656400 3840000.00 1137555.56 1427230.76 200000.00 4139600
## bc70cbc26 fc99f9426 X7e814a30d bd6da0cca X1db387535 b43a7cfd5
## 874 918000 4609500.00 205600.00 1248000 39000.00 1427230.76
## 1369 0 211333.34 458000.00 116000 8583428.58 1372800.00
## 1584 450000 237000.00 519333.34 0 340000.00 20000000.00
## X024c577b9 ea772e115 ad009c8b9 X2ec5b290f X0ff32eb98 X166008929
## 874 221000 0 15800000 190800 1347166.66 7252000
## 1369 11700000 0 281000 2290800 221000.00 660000
## 1584 564800 4900000 11187200 1020500 15800000.00 15800000
## X58e056e12 X241f0f867 cbb673163 X1931ccfdd f02ecb19c X58e2e02e6
## 874 115500.00 1200000.00 0 0.00 0 8583428.58
## 1369 0.00 9731666.66 800000 664000.00 0 2340400.00
## 1584 8478666.66 970500.00 2276500 1124166.66 0 0.00
## X1fe5d56b9 X9fd594eec fb0f5dbfe X91f701ba2 ca2b906e8 X703885424
## 874 14000 110666.66 12686800.00 12307000.00 1720000 1000000
## 1369 25400000 369200.00 2600000.00 110666.66 111000 6206000
## 1584 2560000 6062500.00 9731666.66 383200.00 660000 12686800
## f97d9431e X939f628a7 X62fb56487 eeb9cd3aa X324921c7b X58232a6fb
## 874 1280000 34666.66 0 211333.34 383200 340000.00
## 1369 80000 545333.34 0 5600000.00 6062500 0.00
## 1584 483000 132500.00 80000 52932000.00 6206000 340666.66
## c0d2348b7 X491b9ee45 c8d582dd2 d6bb78916 c10f31664 X4bcf15776
## 874 39000 1554666.66 0 200000 268000 68000.00
## 1369 11187200 1020500.00 15010000 22529000 0 664666.66
## 1584 0 0.00 9307200 3761200 0 116000.00
## X70feb1494 adb64ff71 X11e12dbe8 X9de83dc23 X62e59a501 X15ace8c9f
## 874 500000.00 911000 0 1424000 656400.00 1124166.66
## 1369 525066.66 12686800 2560000 9307200 4139600.00 3677000.00
## 1584 115500.00 1200000 200000 1200000 525066.66 2290800.00
## X5c6487af1 bb0ce54e9 f190486d6 f74e8f13d X5d3b81ef8 X77deffdf0
## 874 2456800.00 0.00 3840000.00 15800000.00 200000 9106000
## 1369 3761200.00 4900000.00 6266666.66 564800.00 634000 600000
## 1584 211333.34 84666.66 1372800.00 8583428.58 10000 0
## c5a231d81 e176a204a X1702b5bf0 a09a238d0 X190db8488 c47340d97
## 874 16000.00 3540000.00 237000 13100000 984000.00 0
## 1369 0.00 1137555.56 52932000 0 1124166.66 115500
## 1584 59333.34 200000.00 664000 1100000 190800.00 78000
## X8781e4b91 ef30f6be5 X23310aa6f
## 874 800000.00 6000.00 1180000
## 1369 1650666.66 950333.34 970500
## 1584 290000.00 600000.00 221000
write.csv(test_non_zero, file = "test_non_zero.csv",row.names=FALSE)
It’s interesting that 1 column name from train_non_zero was not found in the initial test file. We would have to go back to the client to learn about the reasons: was it a typo or some variables were omitted on purpose or something else?
#Compare test and train subsets
comparison <- compare(train_non_zero,test_non_zero,allowAll=TRUE)
comparison$tM
semi_join(train_non_zero,test_non_zero)
Next, we conduct several two-sample T-tests for difference in means. The null hypothesis is that the means are similar and the partition is valid. The alternative hypothesis is that the means are significantly different and the partition is invalid. We assume the significance level of 5%.
mean1<-mean(train_non_zero[,3])
mean2<-mean(test_non_zero[,2])
sd1<-sd(train_non_zero[,3])
sd2<-sd(test_non_zero[,2])
l1<-length(train_non_zero[,3])
l2<-length(test_non_zero[,2])
dfs <- min(l1 - 1, l2 - 1)
tdata <- (mean1 - mean2) / sqrt((sd1^2/l1)+(sd2^2/l2))
pvalue <- 2*pt(tdata, df = dfs, lower.tail=FALSE)
tdata; pvalue
## [1] -1.704320307
## [1] 1.908339802
Based on the test for the first predictor column, the p-value is higher than 0.05, so we don’t have enough evidence to reject the null hypothesis and the partition appears valid.
mean3<-mean(train_non_zero[,4])
mean4<-mean(test_non_zero[,3])
sd3<-sd(train_non_zero[,4])
sd4<-sd(test_non_zero[,3])
l3<-length(train_non_zero[,4])
l4<-length(test_non_zero[,3])
dfs <- min(l3 - 1, l4 - 1)
tdata1 <- (mean3 - mean4) / sqrt((sd3^2/l3)+(sd4^2/l4))
pvalue1 <- 2*pt(tdata1, df = dfs, lower.tail=FALSE)
tdata1; pvalue1
## [1] -2.157003084
## [1] 1.966416377
The previous conclusion is confirmed by the next variable as well, so we will assume the partition is valid for the goals of the modeling.
To start off, we standardize both data sets using z-score method.
train_non_zero_scaled<-scale(train_non_zero[,-1])
train_non_zero_scaled<-data.frame(train_non_zero_scaled)
train_non_zero_scaled$ID<-train_non_zero$ID
train_non_zero_scaled<-train_non_zero_scaled[c(71,1:70)]
train_non_zero_scaled$target<-train_non_zero$target
#Plot correlation
plot_correlation(train_non_zero_scaled,type="continuous")
pairs(train_non_zero_scaled[2:10])
test_non_zero_scaled<-scale(test_non_zero[,-1])
test_non_zero_scaled<-data.frame(test_non_zero_scaled)
test_non_zero_scaled$ID<-test_non_zero$ID
test_non_zero_scaled<-test_non_zero_scaled[c(69,1:68)]
#Plot correlation
#plot_correlation(test_non_zero_scaled,type="continuous")
#pairs(test_non_zero_scaled[2:10])
Next, we analyze and, if necessary, remove outliers.
outliers <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>%
map(~ boxplot.stats(.x)$out)
}
head(outliers(train_non_zero_scaled), 5)
## $target
## [1] 25100000.00 23985000.00 17000000.00 12666666.66 16818666.66
## [6] 30000000.00 35050000.00 13263000.00 20000000.00 14145000.00
## [11] 17000000.00 14000000.00 16733333.34 22134000.00 25300000.00
## [16] 17000000.00
##
## $X20aa07010
## [1] 1.1573695130 0.8962641201 0.6210999047 0.6990349457 1.1573695130
## [6] 1.1573695130 1.7899257083 2.7948408987 1.1555531277 1.1313346570
## [11] 0.3490780359 1.8911891911 0.5938056870 1.6218192449 9.7644325446
## [16] 0.5644286820 1.9093530445 0.3618290610 1.1390845666 1.1573695130
## [21] 0.7014567922
##
## $X963a49cdc
## [1] 2.0828594576 3.3511381057 4.8445821659 1.1783990999 3.1497146247
## [6] 2.0828594576 2.0828594576 3.3205033557 0.3107157596 2.0828594576
## [11] 0.7411186791 0.7196130846 1.3139272337 0.2957302601 1.0828186801
## [16] 4.0710547296 2.0389496503 2.8661900139 1.3098426014 2.0520204749
## [21] 2.0828594576 1.6424849270
##
## $X68a945b18
## [1] 2.2454253992 0.3052909793 0.9776147882 2.2454253992 2.2454253992
## [6] 3.7469485724 2.2454253992 0.5665368022 0.2252524306 3.5388483458
## [11] 2.3862932449 0.3533141085 0.6222436320 1.3988576698 3.2987326998
## [16] 0.3110537548 0.2252524306 3.5708637653 1.0576533368 1.7129022564
## [21] 2.2454253992 0.7531066591 5.7319045796
##
## $X935ca66a9
## [1] 2.3969579776 0.3500431354 3.9811214445 1.0593700609 2.3969579776
## [6] 2.3969579776 2.3969579776 3.7953453450 0.8225055340 0.6256673122
## [11] 3.6095692455 0.6844401146 0.2655994538 3.5082368275 4.5891159521
## [16] 3.7615678724 1.1438137425 0.2655994538 0.3561230805 2.3969579776
Looking at the identified outliers, we don’t feel like we possess enough domain and client knowledge at this point to make a decision to remove outliers. Therefore, we will proceed to modelling with the full standardized data set and, if necessary, make iterative adjustments based on the learnings of the next phase.
Finally, we transform the target variable to improve its normality for modelling.
summary(train_non_zero_scaled$target)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30000 380000 1359000 4380924 5187333 35050000
train_non_zero_scaled$log_target<-log(train_non_zero_scaled$target)
train_non_zero_scaled<-train_non_zero_scaled[c(1:2,72,3:71)]
skew(train_non_zero_scaled$log_target)
## [1] -0.04391634008
ggplot(train_non_zero_scaled,aes(x=log_target))+geom_histogram(fill="blue")+ggtitle("Histogram of Normalized Target")
qqnorm(train_non_zero_scaled$log_target,
datax = TRUE,
col = "red",
main = "Normal Q-Q Plot of log_target Distribution")
qqline(train_non_zero_scaled$log_target,
col = "blue",
datax = TRUE)
plot(rowSums(train_non_zero_scaled[,4:43]), train_non_zero_scaled$log_target, main="Scatterplot of log_target vs row sums", xlab="Row Sums", ylab="Log of Target", pch=16)
Based on the learnings from modelling, we might also consider normalizing predictor variables. However, we need to keep in mind the negative impact it would have on the ability to interpret the results of, for example, linear regression. Using base dollar values makes it much easier to understand the nature of the effect that predictors have on the target.
For some of the numeric predictors, we see the opportunity to apply equal frequency binning.
#Binning one of the predictors
N1 <- length(train_non_zero_scaled[,4])
nbins1 <- 5
whichbin1 <- c(rep(0, N1))
freq1 <- N1/nbins1
train_non_zero_scaled <- train_non_zero_scaled[order(train_non_zero_scaled[,4]),]
for (i in 1:nbins1) {
for (j in 1:N1) {
if((i-1)*freq1 < j && j <=i*freq1)
whichbin1[j] <- i
}
}
whichbin1<-gsub(pattern = "1", replacement = "VLow", whichbin1)
whichbin1<-gsub(pattern = "2", replacement = "Low", whichbin1)
whichbin1<-gsub(pattern = "3", replacement = "Medium", whichbin1)
whichbin1<-gsub(pattern = "4", replacement = "High", whichbin1)
whichbin1<-gsub(pattern = "5", replacement = "VHigh", whichbin1)
train_non_zero_scaled[,4]<-whichbin1
#plot frequencies in the bins
barplot(table(train_non_zero_scaled[,4]))
N2 <- length(test_non_zero_scaled[,2])
nbins2 <- 5
whichbin2 <- c(rep(0, N2))
freq2 <- N2/nbins2
test_non_zero_scaled <- test_non_zero_scaled[order(test_non_zero_scaled[,2]),]
for (i in 1:nbins2) {
for (j in 1:N2) {
if((i-1)*freq2 < j && j <=i*freq2)
whichbin2[j] <- i
}
}
whichbin2<-gsub(pattern = "1", replacement = "VLow", whichbin2)
whichbin2<-gsub(pattern = "2", replacement = "Low", whichbin2)
whichbin2<-gsub(pattern = "3", replacement = "Medium", whichbin2)
whichbin2<-gsub(pattern = "4", replacement = "High", whichbin2)
whichbin2<-gsub(pattern = "5", replacement = "VHigh", whichbin2)
test_non_zero_scaled[,2]<-whichbin2
#plot frequencies in the bins
barplot(table(test_non_zero_scaled[,2]))
write.csv(train_non_zero_scaled, file = "train_non_zero_scaled.csv",row.names=FALSE)
If necessary, we might want to consider binning based on predictive value or clarifying the best cut-off numbers with the client experts.
Also, as we plan to apply the classification decision tree, we have decided to try equal frequency binning for the target variable.
#Binning Target Variable
Nt <- length(train_non_zero_scaled$log_target)
nbinst <- 5
whichbint <- c(rep(0, Nt))
freqt <- Nt/nbinst
train_non_zero_scaled_sorted <- train_non_zero_scaled[order(train_non_zero_scaled$log_target),]
for (i in 1:nbinst) {
for (j in 1:Nt) {
if((i-1)*freqt < j && j <=i*freqt)
whichbint[j] <- i
}
}
whichbint1<-gsub(pattern = "1", replacement = "VLow", whichbint)
whichbint2<-gsub(pattern = "2", replacement = "Low", whichbint1)
whichbint3<-gsub(pattern = "3", replacement = "Medium", whichbint2)
whichbint4<-gsub(pattern = "4", replacement = "High", whichbint3)
whichbint5<-gsub(pattern = "5", replacement = "VHigh", whichbint4)
train_non_zero_scaled_sorted$bin_target<-whichbint5
train_non_zero_scaled_sorted<-train_non_zero_scaled_sorted[c(1:3,73,4:72)]
#plot frequencies in the bins
barplot(table(train_non_zero_scaled_sorted$bin_target))
We start with unsupervised clustering to see if we can identify any patterns of similarity across data rows. We apply hierarchical agglomerative clustering using default complete linkage.
##calculate distance matrix (default is Euclidean distance)
distance = dist(train_non_zero_scaled[,5:72])
# Hierarchical agglomerative clustering using default complete linkage
train.hclust = hclust(distance)
#set_plot_dimensions(1600, 1200)
plot(train.hclust, cex=0.6)
member = cutree(train.hclust,3)
table(member)
## member
## 1 2 3
## 128 8 5
##calculate the same for the test subset
distance2 = dist(test_non_zero_scaled[,3:69])
train.hclust2 = hclust(distance2)
member2 = cutree(train.hclust2,3)
table(member2)
## member2
## 1 2 3
## 89 3 2
Clustering analysis on the entire standardized training subset did not yield any significant results; around 98% of the records have the same cluster membership. Therefore, to maintain consistency in evaluating the models, we will proceed with treating each test subset record individually as well.
Next, we proceed to supervised methods to achieve the goal of predicting future customer transactions.
Given the disguised nature of columns, we start with applying step-wise linear regression in both directions on the whole standardized training subset with the normalized target variable.
#*use train_non_zero_scaled on log_target*
full.model<-lm(log_target ~., data = train_non_zero_scaled[,3:72])
step.model <- stepAIC(full.model, direction = "both", trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = log_target ~ X963a49cdc + X68a945b18 + X935ca66a9 +
## X8.61E.26 + X26fc93eb7 + X66ace2992 + fb49e4212 + X6619d81fc +
## X1db387535 + ea772e115 + X2ec5b290f + X58e056e12 + X58e2e02e6 +
## eeb9cd3aa + X491b9ee45 + c10f31664 + X4bcf15776 + adb64ff71 +
## X11e12dbe8 + X62e59a501 + X15ace8c9f + X5c6487af1 + bb0ce54e9 +
## X5d3b81ef8 + X8781e4b91 + fc99f9426 + f97d9431e, data = train_non_zero_scaled[,
## 3:72])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9690287 -0.6342392 -0.1149387 0.6987682 3.1207249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.0764960 0.1089592 129.19055 < 0.000000000000000222 ***
## X963a49cdc 0.4766312 0.1666008 2.86092 0.00503360 **
## X68a945b18 -0.5251490 0.1996859 -2.62987 0.00973174 **
## X935ca66a9 -0.4812859 0.1660843 -2.89784 0.00451396 **
## X8.61E.26 -1.0638425 0.2233211 -4.76374 0.000005689103 ***
## X26fc93eb7 0.9323432 0.2246230 4.15070 0.000064527305 ***
## X66ace2992 1.1117268 0.2029142 5.47880 0.000000262849 ***
## fb49e4212 0.4686335 0.1552144 3.01927 0.00313281 **
## X6619d81fc 0.4895387 0.2255512 2.17041 0.03206848 *
## X1db387535 0.2922852 0.1676668 1.74325 0.08400983 .
## ea772e115 -0.8026186 0.2147010 -3.73831 0.00029241 ***
## X2ec5b290f 0.5282731 0.1865555 2.83172 0.00548272 **
## X58e056e12 0.5766069 0.1568835 3.67538 0.00036482 ***
## X58e2e02e6 -0.3287354 0.1403697 -2.34193 0.02093592 *
## eeb9cd3aa -0.2783087 0.1537363 -1.81030 0.07290679 .
## X491b9ee45 0.6976034 0.1705011 4.09149 0.000080672794 ***
## c10f31664 -1.4584060 0.2393897 -6.09218 0.000000015779 ***
## X4bcf15776 -0.2945194 0.2009240 -1.46582 0.14547315
## adb64ff71 0.6450074 0.2364863 2.72746 0.00740176 **
## X11e12dbe8 0.9890311 0.1798931 5.49788 0.000000241376 ***
## X62e59a501 -0.3518579 0.1589726 -2.21332 0.02888439 *
## X15ace8c9f 0.2469232 0.1477093 1.67168 0.09735379 .
## X5c6487af1 -0.3712544 0.1719917 -2.15856 0.03299972 *
## bb0ce54e9 -0.8458740 0.2385202 -3.54634 0.00056971 ***
## X5d3b81ef8 0.5875417 0.1119477 5.24836 0.000000726627 ***
## X8781e4b91 -0.5330379 0.1995699 -2.67093 0.00868073 **
## fc99f9426 0.5712708 0.1926983 2.96459 0.00369769 **
## f97d9431e -0.2651615 0.1902237 -1.39395 0.16606916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.293818 on 113 degrees of freedom
## Multiple R-squared: 0.5489271, Adjusted R-squared: 0.4411486
## F-statistic: 5.093105 on 27 and 113 DF, p-value: 0.0000000003662682
The final model is significant as a whole, but explains only 55% of the variation in the data.
#Evaluate using the same training set
train_non_zero_scaled$pred_target_value_reg<-exp(predict(step.model,train_non_zero_scaled))
(rmse1a<-RMSE(train_non_zero_scaled$target, train_non_zero_scaled$pred_target_value_reg))
## [1] 5472996.278
(rmsle1a<-rmsle(preds = as.numeric(train_non_zero_scaled$pred_target_value_reg), actuals = as.numeric(train_non_zero_scaled$target)))
## [1] 1.158248888
par(mfrow = c(2, 2))
plot(step.model)
#test_non_zero_scaled$pred_target_value_reg<-exp(predict(step.model,test_non_zero_scaled))
#*use train_non_zero_scaled on log_target*
tree1<-rpart(log_target ~ ., data=train_non_zero_scaled[,3:72], method="anova", model=TRUE)
#summary(tree1)
tmp<-printcp(tree1)
##
## Regression tree:
## rpart(formula = log_target ~ ., data = train_non_zero_scaled[,
## 3:72], method = "anova", model = TRUE)
##
## Variables actually used in tree construction:
## [1] bb0ce54e9 c0d2348b7 cbb673163 ef30f6be5 fb49e4212 X166008929
## [7] X1702b5bf0 X2ec5b290f X58e2e02e6 X6619d81fc X6eef030c1
##
## Root node error: 419.35171/141 = 2.9741256
##
## n= 141
##
## CP nsplit rel error xerror xstd
## 1 0.181565504 0 1.00000000 1.00383681 0.087798761
## 2 0.143087328 1 0.81843450 1.02562438 0.089021190
## 3 0.077285045 2 0.67534717 0.92000939 0.104195728
## 4 0.058790530 3 0.59806212 0.92807564 0.108825860
## 5 0.037491574 4 0.53927159 0.95356066 0.112960416
## 6 0.036841659 5 0.50178002 0.94590377 0.115178626
## 7 0.032574586 7 0.42809670 0.92121313 0.112424997
## 8 0.021829320 8 0.39552212 0.93655612 0.112242351
## 9 0.018115251 9 0.37369280 0.95756615 0.118192716
## 10 0.014100675 10 0.35557754 0.98509559 0.122510382
## 11 0.010152555 11 0.34147687 1.02150248 0.124023075
## 12 0.010000000 12 0.33132431 1.01876079 0.124058802
(rsq.val <- 1-tmp[,3])
## 1 2 3 4 5
## 0.0000000000 0.1815655038 0.3246528321 0.4019378770 0.4607284070
## 6 7 8 9 10
## 0.4982199805 0.5719032987 0.6044778847 0.6263072043 0.6444224554
## 11 12
## 0.6585231306 0.6686756852
set_plot_dimensions(1200, 1200)
#plot(tree1, uniform=TRUE, main="Regression Tree for Continuous Normal Target ", margin=0.05)
#text(tree1, use.n=TRUE, all=TRUE, cex=.6)
prp(tree1)
After the twelfth split, this regression decision tree model explains around 67% of the variation in the data.
#*use train_non_zero_scaled_sorted on bin_target*
tree2<-rpart(bin_target ~ ., data=train_non_zero_scaled_sorted[,4:73], method="class",model=TRUE)
#summary(tree1)
printcp(tree2)
##
## Classification tree:
## rpart(formula = bin_target ~ ., data = train_non_zero_scaled_sorted[,
## 4:73], method = "class", model = TRUE)
##
## Variables actually used in tree construction:
## [1] adb64ff71 cbb673163 ef30f6be5 fb49e4212 X190db8488 X2ec5b290f
## [7] X5d3b81ef8
##
## Root node error: 112/141 = 0.79432624
##
## n= 141
##
## CP nsplit rel error xerror xstd
## 1 0.107142857 0 1.00000000 1.16964286 0.027214983
## 2 0.098214286 1 0.89285714 1.13392857 0.031705770
## 3 0.071428571 2 0.79464286 1.03571429 0.040492204
## 4 0.062500000 3 0.72321429 0.94642857 0.045799356
## 5 0.040178571 4 0.66071429 0.87500000 0.048811223
## 6 0.026785714 6 0.58035714 0.83928571 0.049978737
## 7 0.017857143 8 0.52678571 0.86607143 0.049122971
## 8 0.010000000 9 0.50892857 0.86607143 0.049122971
#plot(tree2, uniform=TRUE, main="Classification Tree for Categorical Target ", margin=0.05)
#text(tree2, use.n=TRUE, all=TRUE, cex=.65)
prp(tree2)
#evaluating the first tree model (continous) using the same training set
train_non_zero_scaled$pred_target_value_tree1<-exp(predict(tree1, train_non_zero_scaled))
(rmse2a<-RMSE(train_non_zero_scaled$target, train_non_zero_scaled$pred_target_value_tree1))
## [1] 5265257.901
(rmsle2a<-rmsle(preds = as.numeric(train_non_zero_scaled$pred_target_value_tree1), actuals = as.numeric(train_non_zero_scaled$target)))
## [1] 0.9926713606
#evaluating the second tree model (categorical) using the same training set
train_non_zero_scaled_sorted$pred_target_value_tree2<-predict(tree2, train_non_zero_scaled_sorted, type = "class")
confusionMatrix(train_non_zero_scaled_sorted$pred_target_value_tree2,as.factor(train_non_zero_scaled_sorted$bin_target), dnn = c("Prediction", "True Value"))
## Confusion Matrix and Statistics
##
## True Value
## Prediction High Low Medium VHigh VLow
## High 14 2 5 3 1
## Low 1 11 2 1 0
## Medium 4 4 16 2 1
## VHigh 6 4 1 21 4
## VLow 3 7 4 2 22
##
## Overall Statistics
##
## Accuracy : 0.5957447
## 95% CI : (0.5098909, 0.6774884)
## No Information Rate : 0.2056738
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4944329
## Mcnemar's Test P-Value : 0.1429546
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium Class: VHigh
## Sensitivity 0.50000000 0.39285714 0.5714286 0.7241379
## Specificity 0.90265487 0.96460177 0.9026549 0.8660714
## Pos Pred Value 0.56000000 0.73333333 0.5925926 0.5833333
## Neg Pred Value 0.87931034 0.86507937 0.8947368 0.9238095
## Prevalence 0.19858156 0.19858156 0.1985816 0.2056738
## Detection Rate 0.09929078 0.07801418 0.1134752 0.1489362
## Detection Prevalence 0.17730496 0.10638298 0.1914894 0.2553191
## Balanced Accuracy 0.70132743 0.67872946 0.7370417 0.7951047
## Class: VLow
## Sensitivity 0.7857143
## Specificity 0.8584071
## Pos Pred Value 0.5789474
## Neg Pred Value 0.9417476
## Prevalence 0.1985816
## Detection Rate 0.1560284
## Detection Prevalence 0.2695035
## Balanced Accuracy 0.8220607
#test_non_zero_scaled$pred_target_value_tree<-exp(predict(tree1,test_non_zero_scaled))
To start off with this approach, we create a subset of the train_non_zero that includes only some of the variables marked as significant during regression analysis. After making every column in the subset categorical, we proceed with generating association rules for the variable “target” at 35% support and 60% confidence level.
#*use train_non_zero_scaled_sorted_ruled on target*
train_non_zero_scaled_sorted_ruled <- as(train_non_zero_scaled_sorted_ruled, "transactions")
target_rules <- apriori(data=train_non_zero_scaled_sorted_ruled, parameter=list (supp=0.035,conf = 0.6, minlen=3, maxlen=5), appearance = list (rhs=c("target=VLow", "target=Low", "target=Medium", "target=High", "target=VHigh")))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.035 3
## maxlen target ext
## 5 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4
##
## set item appearances ...[5 item(s)] done [0.00s].
## set transactions ...[35 item(s), 141 transaction(s)] done [0.00s].
## sorting and recoding items ... [35 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [58 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(target_rules[1:10])
## lhs rhs support confidence lift count
## [1] {c10f31664=Above Medium,
## b43a7cfd5=Low} => {target=VLow} 0.04964539007 0.6363636364 3.204545455 7
## [2] {X20aa07010=Low,
## c10f31664=Above Medium} => {target=VLow} 0.03546099291 0.6250000000 3.147321429 5
## [3] {X20aa07010=VLow,
## c10f31664=Above Medium} => {target=VLow} 0.04255319149 0.6000000000 3.021428571 6
## [4] {X20aa07010=VLow,
## ea772e115=Medium} => {target=VLow} 0.04255319149 0.6666666667 3.357142857 6
## [5] {X26fc93eb7=Low,
## c10f31664=Above Medium} => {target=VLow} 0.05673758865 0.6666666667 3.357142857 8
## [6] {c10f31664=Above Medium,
## X024c577b9=VLow} => {target=VLow} 0.04964539007 0.7777777778 3.916666667 7
## [7] {X66ace2992=Low,
## c10f31664=Above Medium} => {target=VLow} 0.07092198582 0.6666666667 3.357142857 10
## [8] {c10f31664=Above Medium,
## ea772e115=Medium} => {target=VLow} 0.07801418440 0.6470588235 3.258403361 11
## [9] {X20aa07010=Medium,
## b43a7cfd5=Above Medium,
## X024c577b9=Medium} => {target=Low} 0.03546099291 0.7142857143 3.596938776 5
## [10] {X20aa07010=Low,
## c10f31664=Above Medium,
## b43a7cfd5=Low} => {target=VLow} 0.03546099291 0.7142857143 3.596938776 5
target_rules<-sort(target_rules, by="confidence", decreasing=TRUE)
set_plot_dimensions(1200, 1200)
plot(target_rules[1:10], measure = "support", method="graph", shading="confidence", cex=0.7)
Performed with the help of Rattle package at 15 hidden layers.
set.seed(12)
train_non_zero_scaled=train_non_zero_scaled %>% mutate_if(is.character, as.factor)
target.forest=randomForest(log_target~.,data=train_non_zero_scaled[,3:72],mtry=15,importance=TRUE,ntree=250)
target.forest
##
## Call:
## randomForest(formula = log_target ~ ., data = train_non_zero_scaled[, 3:72], mtry = 15, importance = TRUE, ntree = 250)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 2.357481635
## % Var explained: 20.73
varImpPlot(target.forest, cex=0.5)
train_non_zero_scaled$pred_target_value_forest<-exp(predict(target.forest,train_non_zero_scaled))
(rmse3a<-RMSE(train_non_zero_scaled$target, train_non_zero_scaled$pred_target_value_forest))
## [1] 4683592.621
(rmsle3a<-rmsle(preds = as.numeric(train_non_zero_scaled$pred_target_value_forest), actuals = as.numeric(train_non_zero_scaled$target)))
## [1] 0.6434032396
To come up with recommendations for deployment, we first look at the statistical summary of model performance. Since the project objective is predicting the actual dollar value of the future transaction, we will focus on three supervised methods: linear regression vs regression decision tree vs random forest. We will use the insights from other models to enrich our proposal.
| Linear Regression | Regression Decision Tree | Random Forest of Regression Trees | |
|---|---|---|---|
| R Squared | 55% | 67% | 20% |
| RMSE on the same set | $ 5472996.28 | $ 5265257.9 | $ 4683592.62 |
| RMSLE on the same set | 1.16 | 0.99 | 0.64 |
It’s worth reiterating that these results are only estimates as the target variable for the test set wasn’t provided. Running the same models on a different subset of the training file exhibited drastically worse numbers, which can mean 2 things:
The time series nature of the data makes choosing the subset of the same file by different criteria impossible.
Models are not trained enough to be very useful in practice.
We would need more information from the client to determine specific reasons.
As for other models, the classification tree achieved 60% accuracy and could be used to further analyze the dependencies among predictors and between predictors and the target variable. Yet to improve its power, we need to reevaluate the binning technique applied to both the feature and the target column, based on more accurate data & domain understanding.
Clustering did not yield any significant results on either training or test subset. Depending on the feedback from the client, we might want to revisit this unsupervised method along with principal component analysis. It could help us narrow the focus down to an even smaller group of the most significant features. Then we would have to recheck the partition validation.
Association rules provided valuable insights as multiple complex rules (3 - 5 elements) with specified 35% support and 60% confidence were found just for the subset of variables. It reinforces the hypothesis about a smaller number of truly important features that affect the target variable. However, the identical value of support for many of the top 10 rules raises concerns that require further investigation. It may be related to the equal frequency binning, so to improve the reliability of the insights from association rules, we will likely need to consider binning based on the target or client-defined boundary values.
Neural network with 15 hidden layers fits the data pretty well, as you can see on the respective graph, but pseudo R-square is the lowest across models. NN’s black box nature makes it harder to analyze the reasons for such performance. So, going forward, we will need to reconsider the inputs to this technique.
As we can see, random forest is doing better in every metric except for the percentage of variation explained. However, all metrics are not as high as desired.
Overall, given the volatile nature of anonymized data, the flexibility and advanced algorithms of random forest make it the model of choice in our recommendation. Analysts shouldn’t depend on a single data mining method but instead should seek a confluence of results from different models. That will provide robustness to the conclusions.
Important observation is the potential number of variables that actually matter. Regression output shows only several variables with significant individual p-values; decision trees stop after ~12 splits; and association rules identify numerous strong relationships between the subset of regression variables and the target. Further analyzing this observation can be the key to figuring out genuine time series features vs lag variables or fake noise data.
Steps going forward to improve the predictive power:
First, to support the iterative nature of CRISP-DM process, we will request the feedback on the variables and the domain they were derived from.
Second, we will evaluate the actual test RMSLE with the current modeling.
Based on the findings, we are going to solidify all the aspects of this time series.
After that, we adjust our data preparation methods and revisit clustering and PCA to improve the training set for the models.
So far, only basic transformation methods have been used. We will also need to consider potential formulas of a more complex relationship between predictors and the target (different powers, inverse correlations, etc.).
Finally, we will tune the models, prioritizing random forest, on the updated training set and reapply the predictions to the test set. We might also take a closer look at the support for association rules after enhanced discretizing.
Additional research questions:
Once we better understand the nature of the data, we would look at the potentially missed factors that can influence customer decisions.
Once satisfactory model results are reached, the analysis of existing customer communication methods needs to be conducted to determine the most appropriate channels that would use the project findings.